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Abstract 

Mobile phone datasets allow for the analysis of human behavior on an unprecedented scale. The social network, 
temporal dynamics and mobile behavior of mobile phone users have often been analyzed independently from each 
other using mobile phone datasets. In this article, we explore the connections between various features of human 
behavior extracted from a large mobile phone dataset. Our observations are based on the analysis of communication 
data of 100000 anonymized and randomly chosen individuals in a dataset of communications in Portugal. We show 
that clustering and principal component analysis allow for a significant dimension reduction with limited loss of 
information. The most important features are related to geographical location. In particular, we observe that most 
people spend most of their time at only a few locations. With the help of clustering methods, we then robustly identify 
home and office locations and compare the results with official census data. Finally, we analyze the geographic spread 
of users' frequent locations and show that commuting distances can be reasonably well explained by a gravity model. 
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Abstract 

Mobile phone datasets allow for the analysis of hu- 
man behavior on an unprecedented scale. The social 
network, temporal dynamics and mobile behavior of 
mobile phone users have often been analyzed indepen- 
dently from each other using mobile phone datasets. In 
this article, we explore the connections between various 
features of human behavior extracted from a large mo- 
bile phone dataset. We show that clustering and princi- 
pal component analysis allows for a significant dimen- 
sion reduction with limited loss of information. The 
most important features are related to geographical lo- 
cation. In particular, we observe that most people spend 
most of their time at only a few locations. With the help 
of clustering methods, we then robustly identify home 
and office locations and compare the results with official 
census data. Finally, we analyze the geographic spread 



'Corresponding Author 

Email address: Vincent .traagSuclouvain. be (V.A. Traag) 
Preprint submitted to Physica A 



of users' frequent locations and show that commuting 
distances can be reasonably well explained by a gravity 
model. 



1. Introduction 

Information and communication technologies have 
always been important sources of data and inspiration in 
sociology, especially in recent decades. These technolo- 
gies influence the behavior of people, which is a subject 
of study in itself (e.g.|jl|^), but they also provide mas- 
sive amounts of data that can be used to analyze various 
aspects of human behavior. 

Telephone and mobile phone data have already been 
used to study social networks, sometimes in conjunc- 
tion with features such as gender and age |4]. More 
recently, the mobile phone data available to researchers 
have been enriched with geographical information. This 
allows to analyze regularities, or even laws[5-|7tl, gov- 
erning the highly predictable mobility ijstl in everyday 
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life. These insights can be vital in emergency situa- 
tions 10], or in (preventing) spreading of diseases Hol - 
121] or mobile viruses 1,13] . Furthermore, users' mobility 
and their social network are intertwined: the one could 
be used to predict the other fT?, Ts"], and the proba- 
bility of two people calling over a distance follows a 
gravity like model |TB-i3l- Research has also shown 
there are geographical clusters of highly connected an- 
tennas |20] (e.g. resembling provinces) as well as clus- 
ters in the social network consisting of groups of well 
connected peoplel21-23ll. although the connections be- 
tween the two are not yet fully understood [24]. Similar 
results have also been obtained in a virtual mobility set- 
ting Q. 

In this paper, we analyze anonymized communica- 
tion data from a telecom operator in Portugal. The data 
cover a period of 15 months and the following informa- 
tion is available for each communication: the times of 
initiation and termination, the users involved, and the 
transmitting and receiving antennas (at the beginning of 
the communication). In addition, we also know the lo- 
cations (longitude and latitude) of all antennas. 

We first present a statistical analysis of the data. We 
define a set of 50 general features that we compute 
for each user, and using principal component analysis 
and clustering methods, we show that these features are 
highly redundant: they can all be recovered, with a loss 
of accuracy of less than 5%, using a reduced set of only 
five meta-features. 

Observing that the most important features are geo- 
graphical, we then pay specific attention to the most 
common locations of each user By developing a pro- 
cedure to extract these frequent positions, we observe 
that people spend most of their time in only a few loca- 
tions. We then cluster the different calling patterns for 
each user and each location, and from this, we observe 
that only two types of locations are clearly identifiable, 
namely home and work. We compare our results to cen- 
sus data obtained from the Portuguese National Institute 
of Statistics. 

Finally, we analyze in more detail the behavior of 
users who have exactly one home location and one of- 
fice location. This allows us to predict the number of 
commuters between different regions of the country us- 
ing a gravity model. More precisely, we observe that 
two different regimes exist, the first involving distances 
smaller than 150 km (which is half the distance between 
the two largest cities) and the second involving larger 
distances. In the latter case, only the number of offices 
in the destination region is statistically significant. 

The fundamental contribution of this paper is that we 
improve the understanding of frequent locations. Build- 



ing on previous location inference work ll26ll . we con- 
struct a method for rigorously determining the type of 
locations. It had already been observed that people have 
only a few top locations but it remained unclear 
what type of locations they represent. Although it is 
often (tacitly) assumed they represent home and office 
(S Si), this had never been rigorously analyzed. We 
confirm this hypothesis, and also conclude that these are 
the only type of locations that are robustly detectable in 
the data. 



2. Data Mining and Feature Analysis 

In this section, we analyze the calUng and geographic 
behaviors of mobile phone users based on features that 
summarize these behaviors. These features allow us 
to investigate interdependencies between characteristics 
such as call durations, the distances of calls, the dis- 
tances of movements and the frequency of calls. This 
can be achieved, for example, by analyzing correlations 
between these features. In this section we also use prin- 
cipal component analysis and cluster analysis to better 
understand these interdependencies. 

2.1. Preprocessing 

Before proceeding with the analysis, some prepro- 
cessing of the raw data was necessary. The most impor- 
tant preprocessing step was the application of a moving 
weighted average filter on the calling positions of the 
users. 

This filtering was crucial because the position of the 
antenna does not always accurately reflect the actual po- 
sition of the user. Moreover, due to noise (such as that 
introduced by reflection and scattering in urban environ- 
ments), the closest antenna is not always the one serv- 
ing the call. Without proper filtering, these inaccuracies 
tend to accumulate, particularly for measures such as 
the total distance traveled. 

The filtering was computed as follows. The posi- 
tions were smoothed independently for all users. As- 
sume that a user made calls at times f(l), . . . , tin) and 
the coordinates of the antennas that served the calls are 
x(l), . . . , x{n). The smoothed positions of the user, de- 
noted by 3^(1), . . ■,y{n), can be calculated as 



y(0 = ^w(;)^(;) 



(1) 



with Bsii) - { ./' : |f(;') - f(OI < 5 } where Bs{i) denotes 
the indices of those calls that were initiated or received 
within a maximum interval 6 from the current time of 
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Table 1 : Selected Results of Correlation Analysis 



(a) gyration or standard deviation: the mean 
square error from the average location of 
the user (i.e., the center of the circle). 




(b) diameter of the convex hull: the maximal 
distance the user has traveled during the 
given period (the diameter of the set}. 



total line segment length: the sum of all the 
istances the user has travel between his 
calls (the time of the calls are important}. 



Figure 1: (a) The distribution of the average locations of 
the users. Brighter colors indicate areas in which higher 
numbers of users have their average locations, (b) Three 
methods of measuring customer movements: (1) gyra- 
tion, (2) diameter of the convex hull and (3) total line 
segment length. 

the filtering. We used 6 = 30min for the dataset. Posi- 
tions that were further distant from the current time of 
the decision had proportionally smaller weights: 



where / denotes the current index of the call that should 
be smoothed. 

In addition to filtering, we took into account those 
customers who had made and/or received at least 10 
calls during the period analyzed (15 months). More- 
over, for compression and cluster analysis, we normal- 
ized (scaled) and centered the data. Most of the analy- 
sis was performed on 100 000 randomly (uniformly) se- 
lected users. We performed Student's t-tests to examine 
the statistical significance of the results. 

2.2. Features 

We defined 50 features to summarize users' behavior 
Each feature represents one particular aspect of users' 
behavior as a single number, such as the number of in- 
coming or outgoing calls, the number of people who 
called or were being called by the user, the position (co- 
ordinates) of the user (mean and deviation), the coor- 
dinates of the two most frequently used antennas, the 
durations of incoming or outgoing calls (mean and de- 
viation), the distances of the incoming or outgoing calls 
(mean and deviation), the directions of the incoming or 
outgoing calls (mean and deviation) and various move- 
ment measures. 

Individual features themselves can contain consider- 
able information. For example, by analyzing the aver- 
age locations of the users, we obtained information on 
the distribution of users across the country, and large 
cities can be recognized as bright spots in the left panel 
of Fig. [U 
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.60 


.40 



In addition to a measure referred to as gyration 
we propose two additional measures of customer move- 
ments: the diameter of the convex hull and the total line 
segment length. All three of these measures rely on the 
positions of the user during calls to give some indication 
of how much or how far he has traveled. We take into 
account both incoming (received) and outgoing (initi- 
ated) calls. The sequence of the positions (calls) is not 
important for the first two measures, but it is significant 
for determining line segment length. This is illustrated 
in the right panel of Fig. [T] 

Gyration |6] measures the deviation (mean square- 
error) of each of the user's positions from his average 
location. The diameter of the convex hull measures the 
maximum distance between any two positions of the 
user during a given period. The total line segment length 
sums all of the distances between each pair of consec- 
utive positions of the user Note that the filtering pro- 
cedure explained earlier can have a large impact on this 
final measure. 

2.3. Correlation Analysis 

After computing the values of each feature for each 
user, we analyzed the interdependencies between these 
features using a correlation analysis. As mentioned ear- 
lier, we considered 100000 randomly selected users. 
We used t-statistics to confirm that our results are also 
valid for the complete dataset. In some cases, the cor- 
relations are better analyzed on a logarithmic scale, and 
we have therefore also analyzed the logarithmic corre- 
lations. 

Table [T] shows some of the correlations for the 
100000 randomly selected users. The data in this ta- 
ble shows that movement-related features are correlated 
with some but not all of the other features. Some pairs of 
features there, such as the number of calls and the num- 
ber of callers, are highly correlated as expected. Other 
pairs of features, such as the diameter of the convex 
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feature 1 



feature 1 



^ arg max E f (w'^X)^l = arg max Var \w^x] . (3) 

ll»il=i l|Hil=i 



Figure 2: Illustration of the basic concept behind (a) 
principal component analysis and (b) cluster analysis. 

hull and the average duration of a call, exhibit weaker 
correlations. Note that the correlation measures only 
the linear dependencies between two features, and more 
detailed relationships might be uncovered using more 
complex methods. We do not pursue this further here, 
but instead turn to an analysis of the redundancy of the 
data. 

2.4. Principal Component Analysis 

We now analyze the interdependencies between fea- 
tures using another approach. To what extent are the an- 
alyzed features redundant? In other words, how much 
of the information represented by one feature can be ex- 
pressed by (a combination of) other features? To ad- 
dress this question, we used principal component anal- 
ysis (PCA), which is widely used in various disciplines. 
The basic goal of PCA is to reduce the dimensions of the 
data. It can be proven that PCA provides an optimal lin- 
ear transformation for mean-square-based dimensional- 
ity reduction |27]. 

The core idea of PCA is as follow. Let xi , . . . , x„ e 
M'' be (independent) realizations of a random vector X, 
where we assume that E[X] = (which can be guar- 
anteed, for example, by substracting the sample mean 
from the measurements). In our case, the vectors (x,)"^j 
represent users, while each entry corresponds to a fea- 
ture. 

We aim at finding orthonormal vectors wi , . . . , w^/ G 
W', called the principal components, with the property 
that for all € [\,. . . ,d], the linearly transformed vec- 
tor Yk - WjX, where Wk is {w\, . . ., w^], explains the 
maximum possible variance of X. In other words, if we 
transform our dataset D - [xi, . . . , x„] by S k - WjD, 
then we can reconstruct the matrix D e R''^" from the 
matrix S k ^ K*^" (using Wk) with the smallest possible 
mean square error Note that sometimes the rows of S k, 
i.e., Si = wJD are called the principal components and 
the Wi's are referred to as loadings or coefficients. 

A recursive formulation of PCA can be given as fol- 



The vector wi points toward the direction in which the 
sample variance of the data is maximized. This is of 
course an approximation of what we would get using 
the full (unknown) distribution of X. Having defined 
the first k - I vectors, the A:-th is determined as 

ll>.H=i " ,=1 \ ;=i / 

w arg max Var |^w^xj , (4) 

||M'||=I,VV±(H.v)fr,' 

which is thus chosen to achieve the highest variance 
possible while being orthogonal (±) to the previous 
choices. The vectors (w,)j'^j can be efficiently com- 
puted from the (estimate of the) covariance matrix E - 
E ['^^^J, since vector w,-, / e {1, . . . , d], is an eigenvec- 
tor of the sample covariance matrix corresponding to its 
i-th largest eigenvalue. The basic concept behind PCA 
is illustrated in Fig.|2] 

Our PCA analysis revealed high redundancy among 
the features analyzed. Table |2] shows the results of this 
analysis. It can be seen that if we allow a 1 % (mean 
square) error in the variance, the number of features can 
be reduced by more than 50 % (from 50 to 24). If the 
allowed error is raised to 5 %, we can further reduce the 
number of features to 5, which represents a compres- 
sion rate of 90%. In other words, we can build five 
components using a linear combination of the original 
features, and using only the values of these five com- 
ponents, we can determine the values of any of the 50 
original features with a 5 % mean-square error This im- 
plies that the features have many interdependencies and 
are highly redundant. 

In order to identify which features are most relevant, 
we determined their importance as follows. PCA pro- 
duces a set of orthogonal vectors, (w,)f^i, which point 
toward the directions of maximum variance. As noted 
earlier they are eigenvectors corresponding to eigenval- 
ues of the sample covariance matrix. Furthermore, the 
i-th eigenvalue, Aj, equals to the (sample) variance of 
Si = w^D. Then, each original feature can be identi- 
fied by an element of the canonical basis. For example, 
feature 1 can be identified by ei = <1, 0, . . . , 0)^. The 
importance of feature / can then be defined as the max- 
norm of the projected vector e, on the basis defined by 



Table 2: Compression of Features by Principal Component Analysis 



Variance Kept 


Mean Square Err 


Dimen. Required 


Compress. Rate 


99% 


1% 


24 


52% 


98% 


2% 


13 


74% 


95% 


5% 


5 


90% 



{wilAiY-^y The basis was thus scaled to produce larger 
coordinates in directions of higher variances. Note that 
we have also scaled the scores such that the most impor- 
tant feature has a relative importance of 1 . 

Fig Spresents a list of the features in order of impor- 
tance. The most important features, such as the average 
position of the user and the coordinates of the two most 
often used antennas, are geographic features. This in- 
dicates that the locations of the users and their calls are 
among the most important characteristics. 

2.5. Cluster Analysis 

After analyzing the data using PCA, we performed 
cluster analysis, to identify typical user classes based 
on calling behaviors. We used the subtractive cluster- 
ing method illustrated in Fig.[2l which is a variant 
of the classical mountain method. An advantage of the 
subtractive clustering method is that it can identify the 
number of clusters required. 

The application of subtractive clustering to the nor- 
malized data for 100000 uniformly selected customers 
resulted in 5 clusters. Each of these clusters is identi- 
fied by its central element (a vector of feature values) 
and its range of influence. As with the PCA, we wished 
to identify the main constituent features of these clus- 
ters. We therefore performed a similar analysis as for 
the PCA, using the vectors of the cluster centers as the 
basis for the dominant feature subspace. The results of 
this ordering, presented in Fig.H] indicate that location- 
and movement-related features are important character- 
istics, similar to PCA. 

Note that both PCA and cluster analysis reveal clear 
differences in importance between the x and y coordi- 
nates. This can be expected for an elongated country 
such as Portugal and is probably aggravated by the fact 
that most people live along the coast. 

Although the features concerning the y coordinates 
have a similar importance in both the PCA and cluster 
analysis, there are some clear differences as well. In 
general, an explanation of this could be that PCA fo- 
cuses on global characteristics, it tries to build compo- 
nents (by linear combination of feature vectors) which 



can explain the dataset with minimal mean square er- 
ror Clustering, however, concentrates on local simi- 
larities, and tries to find clusters in which the feature 
vectors are "close" to each other. Nevertheless, various 
geographical features have key importance according to 
both orderings. The most notable difference concerns 
the diameter of the convex hull, which has a very high 
importance in clustering, while it has a relatively low 
importance in PCA. From the PCA analysis this implies 
that the variance in the diameter of the convex hull is not 
important for explaining a large part of the data. From 
the cluster analysis, the differences in the diameter of 
the convex hull are important, even though the variance 
might not contribute that much. This suggests an inter- 
esting effect of the diameter of the convex hull. Besides 
the obvious importance of the y position when cluster- 
ing people, the diameter of the convex hull separates 
people that share similar y positions. In conclusion, the 
features that are important for clustering people are: (1) 
first antenna; (2) second antenna; (3) diameter of convex 
hull; and (4) average position. 

3. Frequent Locations 

In the previous section, we analyzed several features 
and concluded that the most important ones are related 
to geography. Additionally, we observed that most peo- 
ple spend most of their time in only a few locations. 
In this section, we focus on characterizing these fre- 
quent locations for each user by analyzing weekly call- 
ing patterns. Once these frequent locations are charac- 
terized, we analyze them in greater depth. A related, 
although different, concept of habitats ll29ll was recently 
introduced, where habitats are clusters of the associated 
Markov mobility network. However, a single habitat 
might contain several frequent locations. 

As explained in Section ITTl the data are noisy, and 
often, any one of multiple antennas can be used to make 
a call from a given position. Because this can be true 
for frequent locations such as home and the office, we 
first develop a method for estimating which antennas are 
relevant for characterizing such locations. 
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Figure 3: The relative importance of each feature according to PCA, defined as the max-norm of the projection of the 
feature basis on the principal components. 
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Figure 4: The relative importance of each feature according to cluster analysis. 



After extracting the frequent locations for each user, 
we estimate these positions more precisely using a max- 
imum likelihood approach. We then present various 
statistics using these estimated positions. In particular, 
we estimate the amount of time people spend at work 
and home, characterize different combinations of fre- 
quent locations (multiple 'homes' or 'offices'), estimate 
the geographical density of homes and offices, compare 
our estimates to independent statistics and, finally, an- 
alyze distances between home and office (commuting 
distances). 

3.1. Detection of Frequent Locations 

Detecting the most common locations of a user is 
only possible if enough calls involving that user are 
recordecfl For users who make only a few calls, no 
locations can be called "frequent" with any certainty. 
We therefore selected only users who make at least one 
call a day on average and who make consecutive calls 
within 24 hours 80% of the time. The latter constraint 



' In this section, we include both calls and text messages because 
we want to maximize the information on antenna usage; we refer to 
both as "calls". 



requires a certain regularity of users, and excludes users 
with highly bursty behavior|0]. From this selection, we 
selected a random sample of 100 000 users. 

For detecting frequent locations, it is appropriate to 
begin by identifying the most frequently used antenna 
(MFA). However, as stated earlier, the same antenna is 
not always used for calls made from a given position 
(due to load balancing or the effects of noise on the sig- 
nal). Hence, other antennas located near the MFA may 
also be used to serve the frequent location. We must 
therefore consider sets of antennas that are relatively 
close together 

We first performed a Voronoi tessellation, which par- 
titions the space into cells based on the distance between 
each point and the closest antenna. Each Voronoi cell 
includes the set of points that are closer to the antenna 
located in that cell than to any other antenna. Based on 
the Voronoi tessellation, a graph can be created in which 
nodes are neighbors if their associated Voronoi cells are 
adjacent. Each node corresponds to an antenna, and its 
neighbors are called the Delaunay neighbors. 

We next grouped antennas around the MFA based 
on Delaunay neighborship. More precisely, we defined 
the Delaunay radius of each antenna to be the largest 
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Figure 5: Histogram of the number of frequent locations 
per user 

distance between an antenna and any of its Delaunay 
neighbors (this is later used in the estimation of the po- 
sition; see Section l3.3.2l for more information). We then 
merged all antennas around the MFA that are within 
twice this radiu^ and assigned them one "location", the 
position of which will be defined later 

After identifying the first MFA and merging the sur- 
rounding antennas, we moved on to the remaining an- 
tennas, selecting the most frequently used of those and 
repeating the procedure described above. We continued 
iterating until we identified a set of antennas that repre- 
sented less than 5% of a user's calls. We repeated this 
for each user in our selection and thus obtained a num- 
ber of frequent locations for all users. 

The results of this procedure are summarized in Fig.|5] 
and indicate that the average number of frequent loca- 
tions per user is approximately 2.14 and that 95% of the 
users have fewer than 4 frequent locations. This implies 
that the 3 or 4 most common locations are sufficient to 
predict the position of user, most of the time iQl- A sub- 
stantial number of users have only one single frequent 
location, which is usually an office or a home location 
(as we will see later on). This could reflect the pos- 
session of separate business and private phones, one of 
which is (almost) exclusively used at work and the other 
only at home. 

3.2. Clustering of Weekly Calling Patterns 

The data show two clearly identifiable periodic dy- 
namics in mobile phone use: a daily cycle and a weekly 
cycle, as illustrated in Fig. |6] The daily cycle largely 
follows the human circadian rhythm, with a clear drop 
in activity during the night, a gradual increase in the 
morning and a decrease in the evening, with a small dip 
around lunch time. The weekly dynamic is related to 



-We observe that taking twice tlie Delaunay radius yields an error 
of less than 0. 1% for estimating positions. See Section [3.3.2l for more 
information. 



the workweek, with different behavior on weekends as 
compared to work days. 

We collected all of the calls made using antennas as- 
sociated with each frequent location. Because we have 
the time stamps (beginning and end) of each call, we 
know the times at which each frequent location is used. 
The description of this usage at the weekly scale seems 
to be especially suitable for further analysis. We there- 
fore divided the week into 168 hours and aggregated 
the usage pattern of the whole period. This resulted in a 
168 -dimensional vector per frequent location with the 
calling frequency for one hour in each entry. 

Based on the aggregated call vectors for all frequent 
locations, we performed k-means clustering. We ran 
this clustering for k - {2, . . . , 10) to investigate what 
patterns of usage could be distinguished. We found that 
using k - 3 yielded clear results, as displayed in Fig.|6] 
The first cluster clearly represents a pattern related to 
work. During the weekdays, an increase in the usage of 
these antennas occurs during the morning, followed by 
a small dip around noon, and a decrease in usage from 
around 6 p.m. on. During the weekend, these anten- 
nas are used far less. This pattern is in excellent agree- 
ment with independent statistics from the Portuguese 
National Institute of Statistics (INE) in terms of time 
spent at work, as shown in Fig.|6] The second cluster re- 
flects a pattern of usage that appears to be more closely 
associated with a home position. The usage of these 
antennas is lower during the day, and the maximum us- 
age occurs during the evening. These antennas are also 
used more during the weekend than are the antennas in 
the first cluster. Finally, the third cluster appears simply 
to contain locations that do not follow the dynamics of 
the previous two clusters. This cluster follows the more 
general dynamic displayed in Fig.|6] 

We observed that when more than three clusters are 
considered, they tend to yield results very similar to 
those shown here. We expected that we would be able 
to identify additional patterns of usage, such as those 
of calls made by students with a different rhythm from 
working people or calls made from weekend houses that 
show no activity during the week, but we did not ob- 
serve these patterns. Such patterns certainly do exist, 
but they appear to be marginal when compared to the 
established home and office routine. Hence, there ap- 
pears to be no identifiable patterns of usage other than 
the home and office patterns described above. How- 
ever, using only two clusters obfuscates this result, and 
the separation between home and office positions is less 
clear in this case. 

The top 10 most frequent combinations of frequent 
locations are displayed in Table[3] Approximately 32% 
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Figure 6: Weekly dynamics on average (a) and for 
the three clusters (b) detected using k-means cluster- 
ing: home, office and the remainder, shown along with 
independent time usage statistics from the Portuguese 
National Institute of Statistics (INE) 1,30J . Using more 
clusters yields similar results. The dotted lines indicate 
noon of each day. 
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Table 3: The 10 most frequent combinations of frequent 
locations. Each combination is composed of the num- 
ber of homes, offices and unidentified locations a user 
has. Each row indicates such a combination. The empty 
entries indicate no such type of location is present in a 
combination. The last column contains the percentage 
of how often such a combination occurs. 
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Figure 7: Histogram of the number of frequent locations 
per user, with visual separation between "home" loca- 
tions, "office" location and mixed locations (i.e., users 
with one of each) 



of the users have either a single home location or a sin- 
gle office location alone, whereas only 3.5% have only a 
single unidentified location. For users with two frequent 
locations, the most common combination is one home 
location and one unidentified location. Only 6.6% of all 
users have the combination of one home location, one 
office location and no unidentified locations. Approxi- 
mately 85% of the users have at most one home and/or 
one office location, and approximately 12% of the users 
have exactly one home and one office location (and pos- 
sibly multiple unidentified locations). 

Of all frequent locations, approximately 60% ca be 
classified as "home" or "office" (as in the first two 
columns of Table [3]l. We observed that users tend to 
have no more than two identifiable positions, as de- 
picted in Fig. |2l The majority of users have only one 
identifiable location, which is by definition either home 
or office. For users with two identifiable locations, over 
50% have both a home and an office, and the rest has 
either two homes or two offices. 

3.3. Estimating the Position of Frequent Locations 

3.3.1. Basic model 

We propose a model to estimate the position of the 
home, the office and other frequent locations. We con- 
sider a simplified version of the model proposed in 12611 . 
which was also used in IJlll . The underlying idea is that 
users connect to the antenna that has the highest signal 
strength, which is not necessarily the closest antenna. 

We begin by estimating the total signal strength of an 
antenna / at a certain position x. We assume, similarly 
to 1 26], that the total signal strength consists of three 
components: the power of the antennas, the loss of sig- 
nal strength over distance and some stochastic fading of 
the signal due to scattering and reflection in the environ- 
ment. Specifically, we use the following parameters. 
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The position of antenna / is denoted by X;. 



• The power is denoted by /?,-, and we assume this 
to be constant and equal for all antennas because 
we have no information regarding the power of the 
antennas. Therefore, pj = p for all /. 

• The loss of signal at position x for antenna / is mod- 
eled as 

L,(x) = ^- — -, (5) 

where j8 is a parameter indicating how quickly the 
signal decays. 

• The so-called Rayleigh fading of the signal from 
antenna / can be modeled by a unit mean exponen- 
tial random variable Ri 11321] . for which the cumu- 
lative distribution function (cdf) is 

Pr{Ri < r) = F{r) = 1 - e'' . (6) 

Furthermore, we assume all to be independent. 

The total signal strength Si(x) of antenna / at location x 
is then modeled as 



Siix) = pLi{x)Ri, 



(7) 



and we model the probability that a user at position x 
connects to antenna /, Pr(a - i\x), as the probability that 
the signal strength of antenna / is larger than that of any 
other antenna: 

Pr(a = i\x) = PriSiix) > S j(x), Vj) 

^Y]PT(Si(x)>Sj(x)). (8) 

This probability density is displayed in Fig.|8] 

This probability can be seen as a smoothed Voronoi 
tessellation, in which a user will always connect to the 
closest antenna, by taking the limit of — » oo. In 
that case, we are essentially considering the situation 
in which the path loss is dominant over the Rayleigh 
fading. Hence, little noise is involved, and whenever a 
closer antenna exists, it will be used. 




Figure 8: Probability density Pr(fl = i\x) (represented 
by topographic curves) for a particular antenna / (central 
black 'X'), with neighboring antennas (red 'X's) and the 
local Voronoi tessellation (dark lines) also shown. The 
probability density can be seen as a smoothed Voronoi 
tessellation in which there is a small probability of con- 
nection to antenna / when the user is in another Voronoi 
cell. 



3.3.2. Antenna neighborhoods 

As mentioned in the previous section, the probability 
that a user will connect to a specific antenna depends 
on the position of other nearby antennas. The relevant 
set of antennas X can be rather large, which can slow 
down the computation of the probabilities. Using a lo- 
cal approximation might accelerate this process without 
affecting the results. 
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The idea of using a local approximation is tied to the 
decreased probability that a call will be linked to an an- 
tenna that is far away. Only some of the antennas around 
a given position are in fact relevant. It therefore seems 
natural to construct local neighborhoods of antennas so 
as to make the method more efficient without introduc- 
ing any significant error. 

We define the neighborhood X, and the domain D, of 
antenna / to consist of the smallest circle enclosing at 
least all of the Delaunay neighbors (and possibly more). 
As mentioned previously, the Delaunay neighbors are 
those antennas located in adjacent Voronoi cells. 

• For each antenna, we select all Delaunay neighbors 
and then select the maximum distance between the 
focal antenna and any of these neighbors: 

Pi - max{(i(X,,Xj)|y Delaunay neigh, of /), (9) 

where d(Xi,Xj) is the distance between antenna / 
and 



We then define the domain 



Di ^ {x\\\x - X,\\ < 6pi] 



(10) 



as the region within radius 6pi, where <5 is a scaling 
factor. We observe that choosing 6-2 leads to 
an error of less than 0.1% in the computation of 
Pr(fl - i\x) comparecd to using the entire set X. 

• Finally, the set of Delaunay neighbor^ is taken as 
all antennas within this region: 

Xi^{j\XjeDifoijeX}. (11) 

Note that this set contains at least all of the Delau- 
nay neighbors and may also contain other anten- 
nas. 

Finally, using equation (|8]l, we approximate the prob- 
ability as 

Pr(fl = i\x) ^ Y\ PfiSiix) > Sj(x)), (12) 

leading to a large reduction in the computational time 
required. 



3.3.3. Maximum Likelihood Estimation 

We use the model explained above to more accurately 
estimate the position of each frequent location. For each 
such location, we know the number of calls fe, made us- 
ing antenna /. The probability that kj calls were made 
using antenna / given position x is then Pr(fl = /|x)'^' . 
Hence, the log likelihood of observing call frequencies 
k for the antennas in Xf, where / is the MFA of a fre- 
quent location, for a certain position x is 

log J:(x\k) = 2 ki log Pr(fl = i\x). (13) 

ieXf 

The maximum likelihood estimate (MLE) x of the posi- 
tion of a frequent location is then given by 



X = argmaxlogi](x|fe). 



(14) 



To find the MLE, we employ a derivative-free op- 
timization scheme because the gradient of the likeli- 
hood function is costly to evaluate. In particular, we 
use the Nelder-Mead algorithm fl33ll . initialized with the 
weighted average position of the antennas associated 
with the frequent location. The distance between the 
average position of the antennas and the MLE is 1 .7 km 
on average and reaches a maximum of approximately 
35 km. This shows that although using the average po- 
sition provides a reasonable approximation, it is not al- 
ways accurate. 

3.4. Results 

We now analyze the results of the position estimation. 
First, we present our results concerning the geographi- 
cal distribution of frequent locations around the country 
and compare these results to independent statistics. We 
then analyze commuting distances, i.e., the distances of 
travel between home and office, and develop a model of 
the number of commuters between each pair of coun- 
tiell. 

3.4.1. Population density estimation 

The position estimates of all frequent locations can 
be used to analyze the population distribution through- 
out the country. Using the county level data, we counted 
the number of home locations for each county. We then 
compared these results to population density data ob- 
tained from the Instituto Nacional de Estatistia^ (INE). 



average error based on 1000 random points 
'*To deal with antennas near the border of the country (for which 
the Delaunay neighbors can be far away), we take this border into 
account, and create a sUghtly different neighbor set. 



^We used the NUTS-3 data defined by Eurostat, which, in the case 
of Portugal, consists of groups of municipalities; we refer to these as 
"counties" for simplicity. 

*http://www.ine.pt 
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Figure 9: (a) Population sizes per county throughout the country (based on statistics from INE), (b) estimated number 
of homes per county, and (c) the distribution of all frequent locations. Lighter colors indicate higher values. 



As shown in Fig. |9l there is a strong correspondence 
between the INE population data for each county and 
our estimate. The correlation between the two is 0.92. 
This indicates that we can accurately estimate popula- 
tion size based on the mobile phone data. A more ac- 
curate density plot of the frequent locations is shown 
in Fig. |9l which illustrate that these locations are con- 
centrated in the cities. A comparison of Fig. |9]to the 
distribution of the average positions of users over the 
entire period (Fig. [U, shows that the distribution of fre- 
quent locations is more pronounced. Average positions 
are likely to be distorted by commutes and to interpolate 
between home and office. 

3.4.2. Commuting distances 

The home and office positions determined above can 
be used to estimate commuting distances. For individu- 
als who have more than one home or one office, multiple 
commuting distances could be calculated, but it would 
be unclear which distance is the "correct" one. There- 
fore, for this analysis, we considered only the 12% of 
users who have exactly one home and one office (and 
possibly some unidentified frequent locations). This 
means that each user considered has exactly one com- 
muting distance. These commutes are plotted in Fig.fTOl 
with smaller distances indicated in brighter colors. Two 
things stand out on this map. First, the two largest cities 
in Portugal, Porto and Lisbon, are clearly discernible. 
Second, most of the cities appear to predominantly at- 
tract people living in the immediate surroundings. 

The distribution of commuting distances depicted in 
Fig. [TT] appears to be affected by the location of Porto 
and Lisbon. Two different regimes can be discerned: 




Figure 10: Commute map for our sample of users. 
Brighter colors indicate smaller commuting distances. 
Most of the commutes cover only small distances, al- 
though some commutes span half the country. The num- 
ber of commutes decays approximately log-normally 
with distance. 
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Gammuting Distance 

Figure 1 1 : The distribution of commuting distances re- 
vealed by the analysis of mobile phone data. These dis- 
tances exhibit a log-normal distribution for d < 150 km 
(blue line). The full distribution is shown, with the line 
at 150 km separating the two different regimes. These 
two distinct regimes probably arise because almost all 
of continental Portugal is within 150 km of either Porto 
or Lisbon. 



one regime reflecting commuting distances of less than 
150 km and the other reflecting larger distances. This 
coincides with the distance between Lisbon and Porto, 
which is approximately 300 km. In fact, most of Por- 
tugal is within 150 km of one of these two cities. This 
suggests that most people tend to work no further away 
than the closest largest city, i.e., it is unlikely that people 
living near Porto work in Lisbon. The set of commuting 
distances that are less than 150 km can be reasonably 
well fitted using a log-normal distribution with parame- 
ters fi - 2.35 and cr - 0.94, as displayed in Fig.fTTI 

A common model for analyzing commuting distance 
is the gravity model ll 1113411 . although recently another 
parameterless model has been suggested [35] . This 
model formulates the number of trips w/j made between 
two locations / and j as proportional to the population 
sizes at the origin P, and at the destination Pj, with some 
decay, depending on the distance djj between / and j. 
More precisely, the model is formulated as 



pa p/^ 

' J 

7(d~y 



(15) 



where f(dij) is usually taken as either a power law dJj 
or an exponential decay e^"'^ with parameters a, /3 and 



y to be estimated from the data. 

Here, we formulate the gravity model in terms of the 
number of trips (commutes) made between county / and 
county j. Instead of simply considering the population 
size as f , and Pj, we can take into account our previous 
calculations of the distributions of both home positions 
and office positions. The probability of a trip from ; 
to j can then be formulated in terms of the number of 
home locations at the origin H, and the number of office 
locations at the destination Oj. 

Again, we discern two regimes: a close-range regime 
with djj < 150 km and a long-distance regime with 
djj > 150 km. Fitting both the power law decay and 
the exponential decay, we find that the power law decay 
provides a slightly better fit. The results are displayed 
in Fig. [12] and in Table |4] Interestingly, the decay dis- 
tance parameter y for large distances is not significant, 
suggesting that for distances dij > 150 km, the num- 
ber of trips no longer depends on the actual distance. In 
fact, the only coeflicient that is significant for large dis- 
tances is the coefficient of the number of offices at the 
destination. Thus, for larger distances, only the number 
of work opportunities at the destination appears to be 
important. 

The fit of the model is better when the numbers of 
home and office locations per county are used than when 
the population sizes are used. As shown in Table [H the 
values of for the two regimes are 0.52 and 0.26, re- 
spectively, when the numbers of home and office loca- 
tions are used, compared to 0.43 and 0.24, respectively, 
when population sizes are used. Hence, it is worth tak- 
ing into account the numbers of offices and homes when 
modeling commuting distances instead of simply using 
population size as an approximation for both. In the 
present case, the model slightly overestimates the num- 
ber of shorter commutes, indicating that there is room 
for improvement. This deviation might be due to the 
aggregation of information at a small resolution. On the 
other hand, this might also be due to a real effect: dis- 
tances below some threshold have no efi'ect. In this case, 
trips under about 2 km should be almost unaffected by 
distance. Higher resolution data is needed to investigate 
this in more detail. 

4. Conclusion 

In this study, we analyzed the behavior of mobile 
phone customers based on their calling habits. We first 
sampled 100000 customers randomly and filtered their 
locations, as these are based on associated antenna lo- 
cations, which are subject to disturbances. We then de- 
fined and computed 50 features that describe the call- 
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Coefficient Variable 



djj < 150 km djj > 150 km 



Numberof homes at origin 0.17" + 0.013 0.018 + 0.013 

Numberof offices at destination 0.21" + 0.013 0.030* + 0.012 

Distance 0.37" ±0.018 0.13+0.11 

0.52 0.26 

(exponential fit) 0.46 0.26 



a 

y 



Table 4: Fitted parameters and ^ of the gravity model with f{dij) - dip , with standard errors reported. We also 
report R^ for the exponential fit /(<i,/) = e''"''', which is slightly worse. ** p < 0.001, * p < 0.05. 
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Figure 12: Plots of the prediction ratio Wjj/wij for commuting distance of d < 150 km (left panels) and d/j > 150 km 
(right panels) for (a) the power law decay f{dij) - d]., and (b) the exponential decay f{dij) - e^''" . Red squares 
indicate mean values, and blue circles indicate medians. 
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ing behaviors of the customers. We performed a cor- 
relation analysis on these features, which showed that 
movement- and location-related features are correlated 
with many other features. We then analyzed the data us- 
ing principal component analysis (PCA). This showed 
that the original features are highly redundant and can 
be efficiently compressed if some reconstruction error 
(e.g., 5 %) is allowed. We also performed a cluster anal- 
ysis and that revealed a small number of typical user 
classes. We computed the relative importance of each 
feature in the PCA and the cluster analysis and found 
that location- and movement-related features are espe- 
cially important in both cases. We therefore analyzed 
the users' most common locations. 

We clustered these frequent locations based on 
weekly calling patterns and found that only home and 
office locations could be clearly identified. Other pat- 
terns of usage (such as use from weekend houses) are 
surely present in the data, but these are marginal when 
compared to the clear pattern of use from home and of- 
fice locations. We characterized the number of frequent 
locations for each user and the most common combi- 
nations of frequent locations (e.g., multiple houses or 
offices). Finally, we estimated the positions of frequent 
locations based on a probabilistic inference framework. 
Using these positions, we derived a fairly accurate esti- 
mate of the distribution of the population, which showed 
a correlation of 0.92 with independent population statis- 
tics. These positions also allowed us to analyze com- 
muting distances, and we found that the data are rea- 
sonably well explained by a gravity model. This model 
works better when the numbers of homes and offices are 
considered instead of population sizes. This indicates 
that when analyzing commuting distances, it is worth 
taking the distribution of home and office location into 
account. 

The present study represents an exploratory analy- 
sis of the data. Further research into the frequent lo- 
cations and associated user behavior should be under- 
taken. This data set contains both geographical data and 
social network data, and it would be interesting to fur- 
ther analyze the interaction between the two. 
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